Before you begin, you should install all of the relevent R packages. You can download our packages as follows…
library(cytomapper)
library(dplyr)
library(ggplot2)
library(simpleSeg)
library(FuseSOM)
library(ggpubr)
library(scater)
library(spicyR)
library(ClassifyR)
library(scFeatures)
library(lisaClust)
It is convenient to set the number of cores for running code in parallel. Please chose a number that is appropriate for your resources.
nCores <- 40
BPPARAM <- simpleSeg:::generateBPParam(nCores)
theme_set(theme_classic())
In the following we will reanalyse some MIBI-TOF data (Risom et al, 2022) profiling the spatial landscape of ductal carcinoma in situ (DCIS), which is a pre-invasive lesion that is thought to be a precursor to invasive breast cancer (IBC). The key conclusion of this manuscript (amongst others) is that spatial information about cells can be used to predict disease progression in patients. We will use our spicy workflow to make a similar conclusion.
The R code for this analysis is available on github https://github.com/SydneyBioX/spicyWorkflow. A mildly processed version of the data used in the manuscript is available in this repository.
The images are stored in the images folder within the Data folder. Here we use readImages() from the EBImage package to read these into R. If memory is a restricting factor, and the files are in a slightly different format, you could use loadImages() from the cytomapper package to load all of the tiff images into a CytoImageList object, which can store the images as h5 on-disk.
pathToImages <- "Data/images"
# Get directories of images
imageDirs <- dir(pathToImages, full.names = TRUE)
names(imageDirs) <- dir(pathToImages, full.names = FALSE)
# Get files in each directory
files <- sapply(imageDirs, list.files, pattern = "tif", full.names = TRUE, simplify = FALSE)
# Read files with readImage from EBImage
images <- lapply(files, EBImage::readImage, as.is = TRUE)
We will make use of the on_disk option to convert our images to a CytoImageList with the images not held in memory.
# Store images in a CytoImageList with images on_disk as h5 files to save memory.
dir.create("Data/h5Files")
images <- cytomapper::CytoImageList(images,
on_disk = TRUE,
h5FilesPath = "Data/h5Files",
BPPARAM = BPPARAM)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 15965561 852.7 24564655 1311.9 24564655 1311.9
## Vcells 26145971 199.5 2046624540 15614.6 1750709221 13356.9
To associate features in our image with disease progression, it is important to read in information which links image identifiers to their progression status. We will do this here, making sure that our imageID match.
## Read the clinical data
# Read in clinical data, manipulate imageID and select columns
clinical <- read.csv("Data/1-s2.0-S0092867421014860-mmc1.csv")
clinical <- clinical |>
mutate(imageID = paste0("Point", PointNumber, "_pt", Patient_ID, "_", TMAD_Patient))
clinical$imageID[grep("normal", clinical$Tissue_Type)] <- paste0(clinical$imageID[grep("normal", clinical$Tissue_Type)], "_Normal")
clinicalVariables <- c("imageID", "Patient_ID","Status", "Age", "SUBTYPE", "PAM50", "Treatment", "DCIS_grade", "Necrosis")
rownames(clinical) <- clinical$imageID
We can then store the clinical information in the mcols of the CytoImageList.
# Add the clinical data to mcols of images.
mcols(images) <- clinical[names(images), clinicalVariables]
Our simpleSeg R package on https://github.com/SydneyBioX/simpleSeg provides a series of functions to generate simple segmentation masks of images. These functions leverage the functionality of the EBImage package on Bioconductor. For more flexibility when performing your segmentation in R we recommend learning to use the EBimage package. A key strength of the simpleSeg package is that we have coded multiple ways to perform some simple segmentation operations as well as incorporating multiple automatic procedures to optimise some key parameters when these aren’t specified.
If your images are stored in a list or CytoImageList they can be segmented with a simple call to simpleSeg(). Here we have ask simpleSeg to do multiple things. First, we would like to use a combination of principal component analysis of all channels guided by the H33 channel to summarised the nuclei signal in the images. Secondly, to estimate the cell body of the cells we will simply dilate out from the nuclei by 2 pixels. We have also requested that the channels be square root transformed and that a minimum cell size of 40 pixels be used as a size selection step.
# Generate segmentation masks
masks <- simpleSeg(images,
nucleus = c("PCA", "HH3"),
cellBody = "dilate",
transform = "sqrt",
sizeSelection = 40,
discSize = 2,
cores = nCores)
The display and colorLabels functions in EBImage make it very easy to examine the performance of the cell segmentation. The great thing about display is that if used in an interactive session it is very easy to zoom in and out of the image.
# Visualise segmentation performance one way.
EBImage::display(colorLabels(masks[[1]]))
The plotPixels function in cytomapper make it easy to overlay the masks on top of the intensities of 6 markers. Here we can see that the segmentation appears to be performing reasonably.
# Visualise segmentation performance another way.
cytomapper::plotPixels(image = images[1],
mask = masks[1],
img_id = "imageID",
colour_by = c("PanKRT", "GLUT1", "HH3", "CD3", "CD20"),
display = "single",
colour = list(HH3 = c("black","blue"),
CD3 = c("black","purple"),
CD20 = c("black","green"),
GLUT1 = c("black", "red"),
PanKRT = c("black", "yellow")),
bcg = list(HH3 = c(0, 1, 1.5),
CD3 = c(0, 1, 1.5),
CD20 = c(0, 1, 1.5),
GLUT1 = c(0, 1, 1.5),
PanKRT = c(0, 1, 1.5)),
legend = NULL)
In order to charactise the phenotypes of each of the segmented cells, measureObjects from cytomapper will calculate the average intensity of each channel within each cell as well as a few morphological features. The channel intensities will be stored in the counts assay in a SingleCellExperiment. Information on the spatial location of each cell is stored in colData in the m.cx and m.cy columns. In addition to this, it will propogate the information we have store in the mcols of our CytoImageList in the colData of the resulting SingleCellExperiment.
# Summarise the experssion of each marker in each cell
cells <- cytomapper::measureObjects(masks,
images,
img_id = "imageID",
BPPARAM = BPPARAM)
We should check to see if the marker intensities of each cell require some form of transformation or normalisation. Here we extract the intensities from the counts assay. Looking at PanKRT which should be expressed in the majority of the tumour cells, the intensities are clearly very skewed.
# Extract marker data and bind with information about images
df <- as.data.frame(cbind(colData(cells), t(assay(cells, "counts"))))
# Plots densities of PanKRT for each image.
ggplot(df, aes(x = PanKRT, colour = imageID)) +
geom_density() +
theme(legend.position = "none")
We can transform and normalise our data using the normalizeCells function. Here we have taken the intensities from the counts assay, performed a square root transform, then for each image trimmed the 99 quantile and min-max scaled to 0-1. This modified data is then store in the norm assay by default. We can see that this normalised data appears more bimodal, not perfect, but likely sufficient for clustering.
# Transform and normalise the marker expression of each cell type.
# Use a square root transform, then trimmed the 99 quantile
cells <- normalizeCells(cells,
transformation = "sqrt",
method = c("trim99", "minMax"),
assayIn = "counts",
cores = nCores)
# Extract normalised marker information.
df <- as.data.frame(cbind(colData(cells), t(assay(cells, "norm"))))
# Plots densities of normalised PanKRT for each image.
ggplot(df, aes(x = PanKRT, colour = imageID)) +
geom_density() +
theme(legend.position = "none")
Our FuseSOM R package on https://github.com/ecool50/FuseSOM and provides a pipeline for the clustering of highly multiplexed in situ imaging cytometry assays. This pipeline uses the Self Organizing Map architecture coupled with Multiview hierarchical clustering and provides functions for the estimation of the number of clusters.
Here we cluster using the runFuseSOM function. We have chosen to specify the same subset of markers used in the original manuscript for gating cell types. We have also specified the number of clusters to identify to be 20.
# The markers used in the original publication to gate cell types.
useMarkers <- c("PanKRT", "ECAD", "CK7", "VIM", "FAP", "CD31", "CK5", "SMA",
"CD45", "CD4", "CD3", "CD8", "CD20", "CD68", "CD14", "CD11c",
"HLADRDPDQ", "MPO", "Tryptase")
# Set seed.
set.seed(51773)
# Generate SOM and cluster cells into 20 groups.
cells <- runFuseSOM(cells,
markers = useMarkers,
assay = 'norm',
numClusters = 20)
We can begin the process of understanding what each of these cell clusters are by using the plotGroupedHeatmap function from scater. At the least, here we can see we capture all of the major immune populations that we expect to see.
# Visualise marker expression in each cluster.
scater::plotGroupedHeatmap(cells,
features = useMarkers,
group = "clusters",
exprs_values = "norm",
center = TRUE,
scale = TRUE,
zlim = c(-3,3),
cluster_rows = FALSE)
We can check to see how reasonable our choice of 20 clusters is using the estimateNumCluster and the optiPlot functions. Here we examine the Gap method, others such as Silhouette and Within Cluster Distance are also available.
# Generate metrics for estimating the number of clusters.
# As I've already run runFuseSOM I don't need to run generateSOM().
cells <- estimateNumCluster(cells, kseq = 2:30)
##
========
================
========================
================================
========================================
================================================
========================================================
================================================================
========================================================================
================================================================================
optiPlot(cells, method = "gap")
# Check cluster frequencies.
colData(cells)$clusters |>
table() |>
sort()
##
## cluster_2 cluster_5 cluster_16 cluster_10 cluster_7 cluster_18 cluster_11
## 351 361 378 418 570 666 702
## cluster_6 cluster_13 cluster_3 cluster_14 cluster_4 cluster_20 cluster_12
## 869 971 1090 1405 1484 1626 1767
## cluster_15 cluster_8 cluster_19 cluster_17 cluster_9 cluster_1
## 2148 3396 3612 3625 5418 31703
We recommend using a package such as diffcyt for testing for changes in abundance of cell types. However, the colTest function allows us to quickly test for associations between the proportions of the cell types and progression status using either wilcoxon rank sum tests or t-tests. Here we see a p-value less than 0.05 but this does not equate to a small fdr.
# Select cells which belong to individuals with progressor status.
cellsToUse <- cells$Status%in%c("nonprogressor", "progressor")
# Perform simple wicoxon rank sum tests on the columns of the proportion matrix.
testProp <- colTest(cells[, cellsToUse],
condition = "Status",
feature = "clusters")
testProp
## W pval adjPval cluster
## cluster_2 190 0.024 0.48 cluster_2
## cluster_13 390 0.130 0.58 cluster_13
## cluster_6 230 0.150 0.58 cluster_6
## cluster_17 380 0.170 0.58 cluster_17
## cluster_14 380 0.180 0.58 cluster_14
## cluster_1 240 0.200 0.58 cluster_1
## cluster_15 370 0.240 0.58 cluster_15
## cluster_11 370 0.250 0.58 cluster_11
## cluster_8 240 0.260 0.58 cluster_8
## cluster_12 360 0.340 0.68 cluster_12
## cluster_9 360 0.380 0.69 cluster_9
## cluster_5 350 0.440 0.73 cluster_5
## cluster_16 280 0.590 0.83 cluster_16
## cluster_18 280 0.600 0.83 cluster_18
## cluster_10 280 0.620 0.83 cluster_10
## cluster_20 290 0.790 0.92 cluster_20
## cluster_19 320 0.810 0.92 cluster_19
## cluster_3 320 0.830 0.92 cluster_3
## cluster_7 300 0.910 0.96 cluster_7
## cluster_4 300 0.960 0.96 cluster_4
imagesToUse <- rownames(clinical)[clinical[, "Status"]%in%c("nonprogressor", "progressor")]
prop <- getProp(cells, feature = "clusters")
clusterToUse <- rownames(testProp)[1]
boxplot( prop[imagesToUse, clusterToUse] ~ clinical[imagesToUse, "Status"] )
As our data is stored in a SingleCellExperiment we can also use scater to perform and visualise our data in a lower dimension to look for image or cluster differences.
set.seed(51773)
# Perform dimension reduction using UMP.
cells <- scater::runUMAP(cells,
subset_row = useMarkers,
exprs_values = "norm")
# Select a subset of images to plot.
someImages <- unique(colData(cells)$imageID)[c(1,10,20,40,50,60)]
# UMAP by imageID.
scater::plotReducedDim(cells[,colData(cells)$imageID %in% someImages], dimred="UMAP", colour_by="imageID")
# UMAP by cell type cluster.
scater::plotReducedDim(cells[,colData(cells)$imageID %in% someImages], dimred="UMAP", colour_by="clusters")
Our spicyR package (https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html)[https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html] provides a series of functions to aid in the analysis of both immunofluorescence and mass cytometry imaging data as well as other assays that can deeply phenotype individual cells and their spatial location. Here we use the spicy function to test for changes in the spatial relationships between pairwise combinations of cells. We quantify spatal relationships using a combination of three radii Rs = c(20, 50, 100) and mildy account for some of the global tissue structure using sigma = 50.
# Test for changes in pairwise spatial relationships between cell types.
spicyTest <- spicy(cells[, cellsToUse],
condition = "Status",
cellType = "clusters",
imageID = "imageID",
spatialCoords = c("m.cx", "m.cy"),
Rs = c(20, 50, 100),
sigma = 50,
BPPARAM = BPPARAM)
topPairs(spicyTest, n = 10)
## intercept coefficient p.value adj.pvalue from
## cluster_18__cluster_17 57.47863 103.82184 0.004269211 0.6412617 cluster_18
## cluster_17__cluster_18 56.23307 92.78098 0.006867987 0.6412617 cluster_17
## cluster_20__cluster_8 -70.27764 60.25065 0.009107886 0.6412617 cluster_20
## cluster_16__cluster_15 -141.08275 56.94516 0.009162453 0.6412617 cluster_16
## cluster_17__cluster_7 21.28862 107.32206 0.012743804 0.6412617 cluster_17
## cluster_8__cluster_20 -60.10460 60.04871 0.013838365 0.6412617 cluster_8
## cluster_15__cluster_16 -142.88187 48.39057 0.014461077 0.6412617 cluster_15
## cluster_6__cluster_6 192.73531 193.59265 0.015132520 0.6412617 cluster_6
## cluster_7__cluster_17 28.22911 107.75547 0.015205549 0.6412617 cluster_7
## cluster_9__cluster_5 -66.44860 75.52276 0.021968425 0.6412617 cluster_9
## to
## cluster_18__cluster_17 cluster_17
## cluster_17__cluster_18 cluster_18
## cluster_20__cluster_8 cluster_8
## cluster_16__cluster_15 cluster_15
## cluster_17__cluster_7 cluster_7
## cluster_8__cluster_20 cluster_20
## cluster_15__cluster_16 cluster_16
## cluster_6__cluster_6 cluster_6
## cluster_7__cluster_17 cluster_17
## cluster_9__cluster_5 cluster_5
We can visualise these tests using signifPlot where we observe that cell type pairs appear to become less attractive (or avoid more) in the progression sampls.
# Visualise which relationships are changing the most.
signifPlot(spicyTest,
breaks = c(-1.5, 3, 0.5))
Our lisaClust package (https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html)[https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html] provides a series of functions to identify and visualise regions of tissue where spatial associations between cell-types is similar. This package can be used to provide a high-level summary of cell-type colocalization in multiplexed imaging data that has been segmented at a single-cell resolution. Here we use the lisaClust function to clusters cells into 5 regions with distinct spatial ordering.
set.seed(51773)
# Cluster cells into spatial regions with similar composition.
cells <- lisaClust(cells,
k = 5,
Rs = c(20, 50, 100),
sigma = 50,
spatialCoords = c("m.cx", "m.cy"),
cellType = "clusters",
BPPARAM = BPPARAM)
We can try to interpret which spatial orderings the regions are quantifying using the regionMap function. This plots the frequency of each cell type in a region relative to what you would expect by chance.
# Visualise the enrichment of each cell type in each region
regionMap(cells, cellType = "clusters", limit = c(0.2, 5))
By default, these identified regions are stored in the regions column in the colData of our object. We can quickly examine the spatial arrangement of these regions using ggplot.
# Extract cell information and filter to specific image.
df <- colData(cells) |>
as.data.frame() |>
filter(imageID == "Point2206_pt1116_31620")
# Colour cells by their region.
ggplot(df, aes(x = m.cx, y = m.cy, colour = region)) +
geom_point()
While much slower, we have also implemented a function for overlaying the region information as a hatching pattern so that the information can be viewed simultaneously with the cell type calls.
# Use hatching to visualise regions and cell types.
hatchingPlot(cells,
useImages = "Point2206_pt1116_31620",
cellType = "clusters",
spatialCoords = c("m.cx", "m.cy")
)
This plot is a ggplot object and so the scale can be modified with scale_region_manual.
# Use hatching to visualise regions and cell types.
# Relabel the hatching of the regions.
hatchingPlot(cells,
useImages = "Point2206_pt1116_31620",
cellType = "clusters",
spatialCoords = c("m.cx", "m.cy"),
window = "square",
nbp = 300,
line.spacing = 41) +
scale_region_manual(values = c(region_1 = 2,
region_2 = 1,
region_3 = 5,
region_4 = 4,
region_5 = 3)) +
guides(colour = guide_legend(ncol = 2))
If needed, we can again quickly use the colTest function to test for associations between the proportions of the cells in each region and progression status using either wilcoxon rank sum tests or t-tests. Here we see a adjusted p-value less than 0.05.
# Test if the proportion of each region is associated
# with progression status.
testRegion <- colTest(cells[,cellsToUse],
feature = "region",
condition = "Status")
testRegion
## W pval adjPval cluster
## region_4 460 0.0067 0.034 region_4
## region_5 220 0.1000 0.250 region_5
## region_3 330 0.6500 0.800 region_3
## region_2 280 0.6800 0.800 region_2
## region_1 290 0.8000 0.800 region_1
scFeatures is an R package available on https://github.com/SydneyBioX/scFeatures that generates multi-view representations of single-cell and spatial data through the construction of a total of 17 feature types. Here we use it to quantify the proportions of each cell type in each image as well as the average expression of each marker on each cell type. scFeatures outpus a list of two data.frames in this case.
# Use scFeatures to calculate proportions and the average marker abundance
# for each cell type.
data <- scFeatures(cells,
feature_types = c("proportion_raw", "gene_mean_celltype"),
sample = "imageID",
celltype = "clusters",
assay = "norm",
ncores = nCores )
## [1] "generating proportion raw features"
## [1] "generating gene mean celltype features"
names(data) <- c("prop", "mean")
imagesToUse <- rownames(clinical)[clinical[, "Status"]%in%c("nonprogressor", "progressor")]
# Test each marker-celltype for it's association with progression.
test <- colTest(data$mean[imagesToUse,],
condition = clinical[imagesToUse, "Status"])
test |> head()
## W pval adjPval cluster
## cluster_19--PDL1 210 0.0016 0.69 cluster_19--PDL1
## cluster_3--PD1 200 0.0043 0.69 cluster_3--PD1
## cluster_6--PDL1 170 0.0053 0.69 cluster_6--PDL1
## cluster_5--P 460 0.0055 0.69 cluster_5--P
## cluster_19--P 460 0.0072 0.69 cluster_19--P
## cluster_17--P 460 0.0073 0.69 cluster_17--P
Our ClassifyR package, https://github.com/SydneyBioX/ClassifyR, formalises a convenient framework for evaulating classification in R. We provide functionaility to easily include four key modelling stages; Data transformation, feature selection, classifier training and prediction; into a cross-validation loop. Here we use the crossValidate function to perform 100 repeats of 5-fold cross-validation to evaluate the performance of an elastic net model applied to three quanitifications of our MIBI-TOF data; cell type proportions, average mean of each cell type and region proportions.
# Add proportions of each region in each image
# to the list of dataframes.
data[["regions"]] <- getProp(cells, "region")
# Subset data images with progression status
measurements <- lapply(data, function(x)x[imagesToUse, ])
# Set seed
set.seed(51773)
# Perform cross-validation of an elastic net model
# with 100 repeats of 5-fold cross-validation.
cv <- crossValidate(measurements = measurements,
outcome = clinical[imagesToUse, "Status"],
classifier = "elasticNetGLM",
nFolds = 5,
nRepeats = 100,
nCores = nCores
)
Here we use the performancePlot function to assess the AUC from each repeat of the 5-fold cross-validation. We see that the lisaClust regions appear to capture information which is predictive of progression status of the patients.
# Calculate AUC for each cross-validation repeat and plot.
performancePlot(cv,
performanceName = "AUC",
characteristicsList = list(x = "Assay Name"))
Here we have used a pipeline of our spatial analysis R packages to demonstrate an easy way to segment, cluster, normalise, quantify and classify high dimensional in situ cytometry data all within R.